Machine learning in small sample neuroimaging studies: Novel measures for schizophrenia analysis

Abstract Novel features derived from imaging and artificial intelligence systems are commonly coupled to construct computer‐aided diagnosis (CAD) systems that are intended as clinical support tools or for investigation of complex biological patterns. This study used sulcal patterns from structural images of the brain as the basis for classifying patients with schizophrenia from unaffected controls. Statistical, machine learning and deep learning techniques were sequentially applied as a demonstration of how a CAD system might be comprehensively evaluated in the absence of prior empirical work or extant literature to guide development, and the availability of only small sample datasets. Sulcal features of the entire cerebral cortex were derived from 58 schizophrenia patients and 56 healthy controls. No similar CAD systems has been reported that uses sulcal features from the entire cortex. We considered all the stages in a CAD system workflow: preprocessing, feature selection and extraction, and classification. The explainable AI techniques Local Interpretable Model‐agnostic Explanations and SHapley Additive exPlanations were applied to detect the relevance of features to classification. At each stage, alternatives were compared in terms of their performance in the context of a small sample. Differentiating sulcal patterns were located in temporal and precentral areas, as well as the collateral fissure. We also verified the benefits of applying dimensionality reduction techniques and validation methods, such as resubstitution with upper bound correction, to optimize performance.

and autism (Leming et al., 2020;Leming et al., 2021;McAlonan, 2004).These approaches are primarily based on features extracted from imaging of brain function, with fMRI and PET, and brain structure, using MRI from which grey matter volumes can be estimated (Gorriz et al., 2021;Wright et al., 1999) or morphological features extracted from the cerebral cortex (Jimenez-Mesa et al., 2020).A feature newly available from surface representations of the cortex are the sulcal (concave) and gyral (covex) folds (Campero et al., 2014).
Sulcal patterns offer particularly interesting features.They generally form in the last trimester and early life and remain broadly unaltered throughout adulthood, although the complex patterning of the cortex, whilst unique to the individual, strongly varies across individuals.They therefore potentially contain information about early development including the fetal and infant environment (Cachia et al., 2021).There are several approaches in the literature describing the detection, labelling and characterisation of sulci (Andreasen et al., 1994;Auzias et al., 2015;Beeston & Taylor, 2000;Behnke et al., 2003;Mateos et al., 2020;Yang & Kruggel, 2008).BrainVISA (Geffroy et al., 2011), is a software package which undertakes all these steps (Borne et al., 2020;Perrot et al., 2011).Other available packages include Freesurfer (https://surfer.nmr.mgh.harvard.edu)(Schaer et al., 2008) for detection and labelling combined with calc-Sulc (Madan, 2019) or BrainGyrusMapping (Murphy et al., 2014) for the calculation of characterising features of each sulcus.Each approach has its strengths and limitations (Mikhael et al., 2018).
Sulcal information has proven to be useful in the study of a wide range of conditions; for example, in Alzheimer's disease (Maciej Plocharski and Lasse Riis Østergaard, 2016;Mateos et al., 2020), Parkinson's disease (Wang et al., 2011), and anorexia (Collantoni et al., 2021;Wagner et al., 2003).Schizophrenia has a rich and wellreplicated literature establishing patterns of cortical change (Liu et al., 2020;Palaniyappan et al., 2018;Sallet et al., 2003;Zhang et al., 2012).Whilst there has been some work on both overall and specific sulcal information in schizophrenia (Csernansky et al., 2008;Janssen et al., 2022;Rollins et al., 2020), there has not, to our knowledge, been any exploration of the sulcal pattern as a way to classify individuals with schizophrenia from unaffected controls.
One of the main problems often encountered in conducting this type of study is the limited number of samples available.This is of particular concern when the number of features associated with each sample is very high; known as the curse of dimensionality (Gorriz et al., 2017).This is also a problem when applying classical statistics which make strong assumptions based on the sample conforming to the normal distribution.When the number of samples is small, it is not easy to accurately determine the distribution from which they are sampled and sometimes invalid techniques are implemented or inaccurate results are obtained (Eklund et al., 2016;Ioannidis, 2005).For this reason it is useful to consider other methods, such as data-driven approaches (Gorriz et al., 2021) based on artificial intelligence, both machine learning and deep learning (G orriz et al., 2020).A key benefit is to obtain insights similar to those obtained by parametric statistical approaches but without requiring the dataset to satisfy certain conditions.Furthermore, the black box problem, whereby there is no easy interpretation of the biological meaning of a classification result or understanding of the underlying decision-making process, is now being addressed with explainable artificial intelligence (XAI) algorithms (Gunning et al., 2019;Jimenez-Mesa, Arco, et al., 2023;van der Velden et al., 2022).
In this study, we explore the capacity of measurements of sulcal patterns to discriminate between patients with schizophrenia and controls.Initially, features relevant to this classification problem were identified by traditional univariate statistical methods.The accuracy of classification was then compared between machine learning classifiers of varying complexity with input features from prior multivariate analysis of identified features.Explainable machine learning techniques were also deployed to give a richer description of the pattern of casecontrol differences observed.This article is organised as follows.Section 2 provides a detailed description of the database and the preprocessing pipeline applied for extraction of sulcal features.Section 3 describes the methods applied in this study, including feature selection and extraction, classification algorithms, validation methods and XAI techniques, among others.
Then, in Section 4 we evaluate the results obtained.Finally, outcomes and future study are discussed in Section 5, and conclusions are drawn in Section 6.

| Database
Data used in this study consisted of MRI from 65 (27 females) Han Chinese patients with schizophrenia (SCZ) and 57 (24 females) unaffected controls (HC).Participants were recruited from the Shanghai Mental Health Centre and the data set was published and analysed in (Li et al., 2018).All participants provided a written informed consent.
The study was approved by the Ethics Committees of the Shanghai Mental Health Centre and the Institute of Psychology, the Chinese Academy of Sciences.
Participants underwent structural neuroimaging as well as a clinical evaluation.High-resolution T1-weighted structural images were acquired on a 3T MRI scanner with 1 mm isotropic voxel size.Details of the acquisition parameters are given elsewhere (Li et al., 2018).No non-linear spatial normalization was applied to the scans to avoid possible bias generated from shape deformations of the sulcal patterns (Cachia et al., 2008;Mellerio et al., 2016).

| Data preprocessing
The MRI scans were processed using BrainVISA 5.0.4 (Geffroy et al., 2011) to extract sulcal features by means of the Morphologist 2021 pipeline (Borne et al., 2020;Perrot et al., 2011).Information was obtained from 62 areas per hemisphere (123 in total; the sulcus of the supra-marginal gyrus is only defined in the left hemisphere).In each region, the features measured in Talairach space (Louis Collins et al., 1994;Talairach, 1988) were length, depth (average and maximum), fold opening, medial surface of the cortical folds and grey matter thickness (Jin et al., 2018;Pizzagalli et al., 2017).The average and maximum depth and length were calculated as features.Other available features are associated with morphological parameters rather than surface topology.Figure 1 shows an example of an automatically labelled brain by BrainVISA and the features extracted from a specific region.
In some cases, a particular sulcus could not be identified or was misdetected by the Morphologist 2021 pipeline.Therefore, samples with more than 18 of these events (15% of the total number of regions) was excluded from the analysis.Insula (left and right) regions were also excluded because of the high possibility of being misdetected due to its peculiar shape.After these exclusions, any region that still had at least one misdetection across remaining participants was excluded.Finally, features were normalised to zero mean and standard deviation 1.Individuals with any feature with values >6 times the standard deviation were removed.These exclusion criteria result in 49 remaining areas for analysis, which are shown in Figure 2.
The final number of individuals (samples) was 114, the demographics for whom are shown in Table 1.It can be seen that the sample set was matched for size, sex and age, with a sample size of 58 SCZ patients and 56 HC.

| Feature analysis and selection
Following preprocessing described in Section 2.2, sulcal length and maximum and mean sulcal depth were tested by univariate statistical methods to identify features important to classification.Both parametric and non-parametric techniques were considered.

| Parametric techniques
Initially, the Shapiro-Wilk test (Shapiro & Wilk, 1965) was applied to identify which features obeyed a normal distribution, since the null F I G U R E 2 The 49 regions from the BrainVISA sulcal atlas (Perrot et al., 2011) used in this study.All other regions were excluded due to sulcal misdetection.
hypothesis is that samples come from a normally distributed population.
For those features where a normal distribution was followed, a two-sample t-test (Kim, 2015;Welch, 1947) was applied to detect the relevance of the feature to distinguish between schizophrenia and control participants.To compare the importance of features, the p values (Panagiotakos, 2008) associated with the tests were used.
Those features that did not follow a normal distribution were assessed with the Mann-Whitney U test (Fay & Proschan, 2010;Mann & Whitney, 1947), and the corresponding p value used.

| Non-parametric techniques
The importance of a feature to classification was also evaluated by means of an AI-based approach: Statistical Agnostic Mapping (SAM) (Gorriz et al., 2021).First, each feature was independently fed into a supervised classification model.Then, accuracies obtained for each feature were sorted based on a proportion test.The null hypothesis of the test is that the population proportion is similar to a particular proportion, π 0 , given a confidence interval.The test statistic for each feature was estimated as: where b π is the accuracy related to the feature, and . In this last expression, l is the number of accuracies higher than π 0 .In our case, π 0 is the mean of all features' accuracies, both empirical (derived from samples) and actual (related to an infinite sample set).Once the z-statistics were calculated, the p value of each statistic was estimated.For this, the null hypothesis was considered to be true and therefore the test statistic follows a standard normal distribution.From this p value, the most relevant features of the study were determined.

| Feature extraction
Along with feature selection, features were also processed to generate more compact information and reduce the dimension of the feature vector.Partial Least Squares (PLS) (Wold et al., 1984) is a supervised method which allows dimensionality reduction while retaining the patterns for higher separability of the classes.Given a matrix of features, X lxm , where l is the number of samples and m the number of features, and a vector of labelsY lx1 , PLS generates a matrix of loadings X l , which is related to the initial data by the following linear combination: where X s is the score matrix and E the assumed error matrix.The reduced d-dimensional space desired comes from the dimensions of X l (m x d), as m > d.This new reduced space contains the original information of X.

| Classification
Once the features to undertake the classification were selected, the next stage was classification.For the binary classification problem posed in this study, both machine learning (ML) and deep learning (DL) methods were applied.
The ML algorithm implemented in this study was a Support Vector Machines (SVM) classifier with linear kernel (Schölkopf & Smola, 2002).This combination was chosen for its easy explainability as well as its propensity to generate excellent results in neuroimaging (Javier Ramírez et al., 2013;Jimenez-Mesa et al., 2020;Orru et al., 2012).This supervised algorithm establishes the maximummargin hyperplane which separates the samples of the different classes.In the case of a linear binary problem, the set of points, x, that generate the hyperplane satisfy: where w is the normal vector to the hyperplane and b represents the error.The classification is done in such a way that samples on one side of the hyperplane belong to one class, and samples on the other side are associated with the second class.
The selected DL architecture was a multilayer perceptron (MLP) as both the number of samples and features was small.Additionally, a two-dimensional vector favours MLP over convolutional neural network (CNN).The MLP is a feedforward artificial neural network (ANN) composed of fully connected layers.Each layer has i perceptrons which are connected in a forward direction to the perceptrons of the next layer, but with no connections between perceptrons of the same layer.Given a layer n, the output value of each perceptron is computed as: where f Á ð Þ is the activation function applied to the i-th perceptron.
This function is applied to the result of multiplying the weight vector, w n i , and the activations of the previous layer, y nÀ1 , in addition to an associated bias, b n i .The network configuration implemented is shown in Figure 3.
The number of epochs involved in the training was 18, with a batch size of 1.The optimizer selected was Adam with a learning rate of 0.001, and the stopping criterion computed as the cross entropy loss with balanced weights.

| Validation procedure
Two validation methods were used to assess the performance of the classifiers.First, a 10-fold stratified cross-validation scheme (Kohavi, 1995) was applied, which guaranteed independence between training and test samples.The sample was randomly divided in to a set in 10 folds, and for 10 iterations used one of the folds as test samples and the remaining folds as the training set.For the computation of the performance metrics, the mean and standard deviation of the values obtained in the 10 iterations were used.
The second validation method was an upper bound-corrected resubstitution (Vapnik et al., 1994), which is referred to in previous work as RUB (Jimenez-Mesa, Arco, et al., 2023;Jimenez-Mesa, Ramirez, et al., 2023).One way to define the upper bound would be as the difference between empirical and actual errors given a fitted Þj, or in terms of the previous approach, the difference between the training error and the test error.Thus, the entire database was used as the training set for the classifier, that is, resubstitution was performed, and then the actual accuracy was obtained by means of the upper bound.This could be considered a theoretical classification limit that allows the use of all accessible data to compute the metrics of interest.In addition to accuracy, other metrics such as sensitivity or specificity can also be of limited value since their errors are related to the classification error.
Different upper bounds are described in the literature.The most well-known is based on the VC dimension as proposed by V. Vapnik (Vapnik et al., 1994).In this article, an upper bound based on the assessment of concentration inequalities was applied (G orriz et al., 2019).This bound is only applicable to linear classifiers, for example, SVM with linear kernel, and its expression is: where n is the number of samples used, size of the sample set, d is the feature's dimension, and η is the significance level.In this study, the significance level was set as 0.05.

The implementation of probably approximately correct (PAC)-
Bayesian bounds is another interesting proposal.In this study, a dropout bound (McAllester, 2013) was analysed.This bound considers a dropout rate, α 0, 1 ½ , which reduces the complexity cost of the function.The effect of this dropout is stronger the closer its value is to 1.
The expression of this bound in the scenario proposed in this study is: F I G U R E 3 Scheme of the MLP composed of four layers: input layer, two hidden layers and the output layer.AF, activation function.
where k different values of the parameter λ, which was set to 1=2 ≤ λ ≤ 10, were evaluated to minimise the bound.The estimated value of the loss function to be bounded, that is, the error of the classifier, is b L Q ð Þ.Its maximum value, L max , which must be a real number, is 1 in this case.Finally, Θ was the classifier's parameter set.

| Explainable AI
Algorithms that give a qualitative understanding of performance are key to extracting domain information from classification tasks.This emerging field is referred to as explainable artificial intelligence (XAI).
Here, apart from considering the performance of the classifier, two of these techniques were used to analyse the influence of features on the decision making by the classifiers.
Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016)  This algorithm is able to explain any prediction model f locally.
This means that LIME provides explanations for a particular sample x, since globally faithful explanations are still a challenge for complex models (Ribeiro et al., 2016).To do this, the algorithm selects an explanation model g G, where G is a class of potentially interpretable models.The selection is made according the following objective function related to the faithfulness of the explanation model: where interpretability and local fidelity is ensured by minimising the trade-off between the loss related to the discrepancy between g and f given the local kernel π x , and the complexity of g, measured by Ω g ð Þ.
SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) is a model-agnostic algorithm which can explain any classification model.
SHAP assigns the relevance of each feature by means of Shapley values, a concept from game theory (Shapley, 1953).
Given the set of features S, the contribution of each feature s is estimated on the basis of its average marginal contribution to all subsets of features T ⊆ S, which do or do not include the feature s.Let the prediction of the model given a particular sample and a subset of features be denoted as f x T ð Þ.The marginal contribution of the feature s is estimated as the difference in predictions when applying or not applying such a feature, . So, the Shapley value, ϕ s , is computed considering all possible subsets T ⊆ S ∖ s f g: SHAP values are the solution to Equation ( 8 Both techniques have generated interesting results in previous neuroimaging studies (Lombardi et al., 2021;Lombardi et al., 2022;Scheda & Diciotti, 2022), where their application allows observation of the congruence of the explanations and their usefulness.

| Performance evaluation
Performance of the classifiers was evaluated through metrics extracted from the confusion matrix, where the positive class was SCZ.These metrics were balanced for accuracy, specificity and sensitivity.Their equations are: BalAcc ¼ 1 2 where T P refers to the number of participants correctly classified as SCZ (true positives), T N corresponds to the number of controls correctly identified (true negatives), F P quantifies the number of controls misclassified (false positives), and F N quantifies the number of SCZ participants misclassified (false negatives).
The receiver operating characteristic (ROC) curve was also constructed.The area under the ROC curve (AUC) evaluates the ability of the model to differentiate between the two classes (Hajian-Tilaki, 2013;Mandrekar, 2010).

| Summary of the procedure
The several stages of this work are depicted in Figure 4.In summary, once data were preprocessed as described in Section 2.2, two different scenarios were implemented: feature selection (see Section 3.1), which highlighted the relevance of sulcal features, and feature extraction (see Section 3.2), which generated a reduced set of features to make the best possible use of the information extracted from the original data.From the features highlighted or generated by both approaches, a classification stage followed where a ML model (SVM) was applied using various validation methods, as described in Sections 3.3 and 3.4.Lastly, the classification model (MLP) obtained when the 147 preprocessed features were used was analysed by means of the XAI described in Section 3.5.
Three respective features (length, mean depth and maximum depth) were extracted from 49 brain regions.The total number of features in this study was 147, a value larger than the number of samples, 114.This is an undesirable, but very common situation in neuroimaging.

| Parametric feature selection
To analyse the relevance of the features, firstly we followed a parametric approach.The Shapiro-Wilk test for normality determined that among the 147 features 125 followed a normal distribution, while the remaining 22 did not.The top row of Figure 5 shows examples of histograms of three features that did not follow a normal distribution.It can be seen that the main reason for this was the long tails skewing the distribution.
By visual inspection selecting eligible samples, the number identified was adequate for a two-sample t-test with all the features.The Mann-Whitney U test was used for non-normally distributed features.
The significance of each feature was assessed with the p value obtained in their respective tests.The bottom graph in Figure 5 shows a boxplot of the nine most relevant features according to the tests applied.Only the first five had a p value <0.05, while 10 of them had a p value <0.1.
F I G U R E 4 Flowchart of the study.After preprocessing the data, two independent feature selection and feature extraction analyses were conducted.The information extracted from both was fed into a ML classifier.Two validation methods were applied.Finally, the 147 preprocessed features are fed into a multilayer perceptron (ML) model.The classifiers' performance was analysed by means of XAI techniques.

| Non-parametric feature selection
Using the SVM algorithm with linear kernel and a validation based on the resubstitution with upper bound correction, 1000 permutations were applied by randomly modifying the position of the samples in the set and calculating the mean accuracy for each feature.For accuracy estimation, a balanced estimator with class weights balancing was applied during the classifier training.The p value associated with each feature was assessed with a significance test for a proportion.
The nine most relevant features obtained are shown in Figure 6.

| Most relevant features analysis
Figure 7 shows the most relevant features ranked by their p value in both approaches (parametric and non-parametric).Relevant features had a p value <0.05, identifying nine features.Three features appeared with both approaches, together with two from the parametric and four from the non-parametric approach.These include both depth-related and length-related features.

| Use of reduced dimensionality in classification
Instead of analysing the relevance of the features independently, it is possible to analyse the relevance of the feature set for the casecontrol classification.To do this, a feature extraction stage was implemented by applying PLS to the original data.The reduced feature dimension was then classified with a SVM classifier with a linear kernel.Performance was analysed using cross-validation (K-fold) and RUB as validation procedures.Figure 8 (left) shows results how the F I G U R E 5 Statistical features analysis.Top row: histograms related to non-normal distributed features.Bottom row: boxplots of the nine most relevant features according to the Two-sample t-test and the Mann-Whitney U test, depending on whether the feature follows a normal distribution or not.These features are arranged from Frontal lobe to Occipital lobe (from left to right and from top to bottom).Average and maximum depth are abbreviated as meandepth and maxdepth.classifier's performance varied according to the PLS components used.
The upper bound applied in RUB, as detailed in Equation ( 5), depends on the number of features, the fixed number of samples ( 114) and significance level (0.05).Although there were no discernible trends in performance as a function of number of PLS scores using K-fold, a decreasing trend was observed applying RUB and higher accuracies F I G U R E 6 The nine most significant features obtained by a classification approach.Their related accuracy was estimated as the mean value of 1000 permutations shuffling the samples and using a SVM with lineal kernel classifier and resubstitution with upper bound correction (RUB) as a validation approach.Resubstitution accuracy stands for the empirical accuracy obtained before the upper bound is applied.The p values related to each region were estimated using a test of a proportion.Average and maximum depth are abbreviated as meandepth and maxdepth.
F I G U R E 7 Features under analysis with a p value <0.05 in any of the parametric and non-parametric tests.These features are arranged from Frontal lobe to Occipital lobe.The significant features under the parametric analysis are coloured cian, non-parametric analysis are coloured magenta, or if both they are coloured green.Average and maximum depth are abbreviated as meandepth and maxdepth.
F I G U R E 8 Left: Performance of the SVM classifier along with PLS as the feature extraction technique.Results are shown for a wide range of PLS components (1-20).Right: performance of the SVM classifier using four PLS components for several balanced samples sizes (20, 30, 40, 64, 88 and 112).In both cases: RUB (orange line) and 10-fold CV (box-plots).Resubstitution accuracy (black line) stands for the empirical accuracy obtained before the upper bound is applied.
were obtained using fewer components (<6).Conflating both approaches, four PLS components were selected and results are illustrated in Figure 8 (right).

| Impact of sample size
To understand the effect on performance of sample size, four PLS components were chosen as input features for the classifier.The results are shown in Figure 8 (right).Theoretically, accuracy should increase as the sample is enlarged, but this was not the case with K-fold.The upper bound applied in RUB changes according to the number of samples, with a fixed number of features, 4, and a significance level, 0.05.

| Various classification scenarios
Case-control classification was undertaken with the features selected with the parametric, non-parametric and ensemble approaches as well as PLS features.The testing of the classification results were obtained by performing 1000 permutations of the dataset, which are shown in Table 2.Note that for the computation of the upper bound of Equation ( 5), the RUB validation approach took into account the number of samples, 112 (as the data was balanced in each iteration), the number of features (9 or 4, depending on the case), and the significance level (0.05).This gave values for the upper bound of 0.3695 per unit or 36.95% for nine features and 0.2675 (26.75%) for 4. The reader is reminded that these values must be subtracted from the accuracy rate obtained in order to determine the actual worst-case accuracy rate.While K-fold CV worked reasonably well with the extracted features, especially with those obtained with the parametric method, RUB had improved performance with the PLS components due to fewer input features, and thus a tighter upper bound.This is especially true when extracting the main components of the full set of features.

| Comparison of upper bounds
A PAC-Bayes upper bound was applied under the same experimental conditions to test its performance against the upper bound based on concentration inequalities.As this different bound depends on the dropout rate, see Equation ( 6), several values of dropout were applied: 0, 0.25, 0.5, 0.75 and 0.95.The results are shown in Figure 9, where the dashed horizontal lines represent the value shown in Table 2 with the RUB approach.

| Examining predictions with XAI
The same classifier was tested using the 147 features as input features.A summary of the performance results are shown in Table 3.
Accuracy values were below 50%.With the same features as input, the MLP achieved a 58.83% accuracy on the test set by applying CV.By applying all the features as input, we observed from explainable artificial intelligence techniques the main focus of the algorithm.
Due to the better performance obtained using MLP, the subsequent results are associated with this classification model.Note: Results using four PLS components as input to the classifier are also included when they are extracted from all 147 and the nine globally significant ones.Upper bounds related to this analyses were 0.3695 (9 features) and 0.2675 (4 features) for a significance level of 0.05.indicate some dependency between features.At the bottom, it is observed that higher values of left inferior frontal sulcus maximum depth, that is, deeper values, bring the sample closer to the SCZ class (higher SHAP value).In the last graph, which includes the comparison between left anterior lateral fissure maximum and mean depth, there is a correlation between the two features, since samples with low values of the maximum depth also have a lower mean depth.

| DISCUSSION
In this study, a staged approach of statistical, ML, and DL techniques were applied to perform an analysis of sulcal patterns in a casecontrol comparison of schizophrenia.Feature calculations were performed by BrainVISA, where a 3D U-Net convolutional neural network was implemented to the labeling of sulci (Borne et al., 2020).
Subsequently, sulcal length and depth were selected as features.
These features were standardised and independently tested with parametric (t-test) and non-parametric (data-driven) approaches.
Machine and deep learning algorithms were applied to classify SCZ patients from HC, and its predictions are evaluated by XAI techniques.
Unlike most work on sulcus patterns, the features applied in this study are extracted fully automatically and encompass the entire cerebral cortex.The sulcal dectection processes remain in their early development, as it is still very difficult to correctly label all the sulcal patterns, especially those that are small or peculiarly shaped (Maciej Plocharski and Lasse Riis Østergaard, 2016).In the dataset used, the amount of detection failures obtained was high, thus reducing the number of sulci and the number of subjects finally included in the study.This made it impossible to study some high-interest regions such as the left hemisphere paracingulate sulcus (Rollins et al., 2020), while the right hemisphere pair is represented in Figure 1 as right Accuracies obtained with the RUB approach using two different upper bounds.The dashed horizontal lines are the accuracies obtained with the upper bound based on concentration inequalities (Equation ( 5)).Accuracies with markers are those with the PAC-Bayes bound (Equation ( 6).The classifier applied was SVM using the nine extracted features in the parametric, non-parametric and both analyses, and four PLS components.Accuracies shown are the mean values after 1000 permutations.
T A B L E 3 Classification performance of models based on SVM and MLP when the 147 features (the complete set) were fed as input of the classifier.Cross-validation was used as validation approach (10-Fold CV).similar values for the extracted features were not achieved.For these reasons, the literature includes a large number of works which combine automatic extraction and manual revision (Janssen et al., 2022;Liu et al., 2020;Shen et al., 2018), apply manual segmentation (John et al., 2006) or reduce the study to a concrete number of regions of interest (Jin et al., 2018;Yang et al., 2019).
Most significant features obtained in this study reflect a similar importance of length and depth, as it can be seen in Figures 5-7, albeit slightly higher in the case of depth.Regarding the relevance of using the maximum or mean depth, their occurrence in the most significant features is practically identical.However, given the same feature, both are not necessarily equally relevant.According to the upper right graph in Figure 11, while the correlation between maximum depth of the intermediate precentral sulcus and its effect on classification is inversely proportional, the mean depth has no direct relationship with maximum depth.
The hemisphere most represented in these findings is the left hemisphere, which is consistent with other studies (Cachia et al., 2008;Liu et al., 2020;Ribolsi et al., 2014;Rollins et al., 2020).
Both the length and depth of the sulci in this hemisphere tended to be smaller in SCZ subjects, as observed previously (Cachia et al., 2008).For example, in line with previous studies, Figure 11 (left) shows a negative correlation between the intermediate precentral sulcus and the disease (Nesvåg et al., 2014;Palaniyappan et al., 2014).
Nevertheless, differences were also found in the right hemisphere, which is aligned with hemispheric symmetry previously discussed in the literature (Csernansky et al., 2008), and as can be seen in Figure 10, where both left and right values were relevant in the classification.Nevertheless, there is something noteworthy in the relevance of the temporal region and that is that there was no decrease in the length values of this region for those from the SCZ class.There was a decrease in the value of maximum depth in the superior temporal sulcus in SCZ patients, which is consistent with previous study (Rollins et al., 2020).In fact, this feature is one of the most relevant obtained in the non-parametric approach, see Figure 6.
Several other features associated with the temporal cortex can be seen in Figure 5.Of these, only the posterior terminal ascending branch of the superior temporal sulcus (S.T.s.ter.asc.post) had a lower average value for SCZ samples.This is also seen in Figure 11 Carr, et al., 2015;Yang et al., 2019;Yücel et al., 2002).However, it is impossible to draw clear conclusions about this area in this study due to the elimination of most of its features at the preprocessing stage by the failure of the sulcal detection software.It is possible that this occurred because the surface morphology of this region varies greatly from one subject to another making it difficult to classify automatically.For this reason, the literature tends to undertake manual detection of this sulcus (Garrison, Fernyhough, McCarthy-Jones, & Haggard, 2015;Rollins et al., 2020).
The limitations of this study include the reduced number of samples available.With a larger sample size, the results obtained could be strengthened and subtle changes in sulcal dimensions could be analysed in more detail.This is especially important when applying deep learning, as shown in its performance in Table 3.When introducing the 147 features, the network, although not excessively complex, was not able to obtain robust classifications due to a lack of samples.
Therefore, in order to optimise the information extracted from the available data and to avoid the curse of dimensionality (samples vs. features ratio) (Gorriz et al., 2017), in addition to the widely used cross-validation, resubstitution with upper-bound correction was also adopted (Jimenez-Mesa, Ramirez, et al., 2023;Vapnik, 1982).
This approach allows better performance to be obtained in small sample sizes, especially when the number of features is very Jimenez -Mesa, Ramirez, et al., 2023).This is because it takes advantage of all the samples in the set to fine-tune the classification approach (resubstitution), adjusting the results a posteriori without bias (upper bounding).This is clearly seen in the PLS component and sample size studies in Figure 8.For example, while the performance using CV was very similar for different PLS values, RUB managed to improve performance when the number of components was small.
With the sample size used in this study something similar happened.
RUB managed to improve the results with increasing sample size, while CV remained inconsistent for any sample size.The former was expected, since by increasing the set, the classifier's learning should theoretically improve.
The contrast between validation approaches is also seen in Table 2.In this case, by working with a slightly larger number of features, 9, the upper bound obtained to apply in RUB was large, and therefore better results were obtained by applying K-fold.Conversely, when the number of features was 4 (PLS column), the best performance was again achieved by using RUB, irrespective of the upper bound applied, see Figure 9.In this figure, when using K-fold, the generalisation capacity of the algorithm was lost.RUB managed to maintain results close to those obtained with the most relevant features in the first columns.Consistently better results were obtained using only the most relevant features compared to dimensionality reduction techniques.This is because in the feature extraction process, all analysed regions were included.
The results in Table 2 also indicate better results when using the features selected by parametric rather than non-parametric methods.
The difference in accuracy is <4%, so both methods were feasible to use.This suggests that in the absence of normal distributions or with a reduced sample size, non-parametric techniques are a tempting option.
However, Figure 7 shows how both methods report relevant features.
Future study will expand the analysis to include the interaction between features, as well as comparisons between sulcal and gyral morphological features.For this purpose, further processing tools will be tested, such as calcSulc and Freesurfer with a multidisciplinary working group, in order to be able to analyse in detail all the results obtained.It would also be useful to expand the database to be able to verify the results obtained on an independent dataset.It would even be highly interesting if such an extension could include databases from different regions in order to be able to detect the environmental impact on schizophrenia.Moreover, more specific studies could be conducted, such as the identification of patterns in those who suffer from hallucinations.

| CONCLUSION
In this study, we evaluated the potential of several ML and DL tech-

1
Example of a brain with sulci regions automatically labelled by BrainVISA using Morphologist 2021 pipeline (right).The central sulcus is highlighted (middle) and indicates how length and depth are measured in a region (left).
to identify qualitative patterns on the most relevant features according to a classifier that distinguished case and control classes.Four examples of individual explanations are shown in T A B L E 2 Performance of the SVM classifier using the nine extracted features in the parametric, non-parametric and both analyses after 1000 permutations.

Figure 10 .
Figure 10.These examples are related to correctly classified HC (top row) and SCZ (bottom row) test samples by the MLP.For this analysis, the 10 most relevant features in the classification for each sample were selected and displayed sorted from most to least importance according to LIME.All four major lobes of the brain appear in this analysis, although the Temporal and Frontal lobes have greater representation.The same applies to the three types of features related to length and depth.Several features included in Figure 7 as the most relevant features, according to parametric and non-parametric approaches, were also relevant in this analysis.One prominent example was the length of right posterior occito-temporal lateral sulcus.In the top right sample, a low value of the length of right posterior occito-temporal lateral sulcus decreased the probability of being associated to HC class, while in the bottom right sample, a similar low value increased the chance of being classified as a SCZ patient.

Figure 11
Figure 11 also includes several dependency plots of these most relevant features.Dependency plots illustrate the relationship between the SHAP value and the magnitude of the feature.A second feature reflected in the colour of the samples is included, which may by the association of high values of inferior temporal sulcus features with the SCZ class.On the contrary, the length of posterior occitotemporal lateral sulcus was associated with smaller values for the SCZ class, see Figure 10.As mentioned above, one of the most important regions for the study of schizophrenia is the medical surface of the brain around the cingulate sulcus (Garrison, Fernyhough, McCarthy-Jones, Haggard,F I G U R E 1 0 Local explanations extracted from LIME for the schizophrenia (SCZ) and control (HC) classes.Features in green represent values that increase the chance of being classified as the class under analysis.Features in red reduce it.Top row: explanations for two correctly classified HC test samples.Bottom row: explanations for two correctly classified SCZ test samples.To improve comprehensibility, length, mean depth and maximum depth are underlined in red, green and blue, respectively.On the left side, the letters F, T, P and O represent the feature belonging to Frontal, Temporal, Parietal or Occipital lobe, respectively.Average and maximum depth are abbreviated as meandepth and maxdepth.
small (ideally 1) (Castillo-Barnes et al., 2020;Gorriz et al., 2021;    F I G U R E 1 SHAP charts, where point represents an instance of the test sample.Top left: Summary plot of features importance in the classification decision; the 10 most relevant are shown.To improve comprehensibility, length, mean depth and maximum depth are underlined in red, green and blue, respectively.Letters F, T, P and O represent the feature belonging to Frontal, Temporal, Parietal or Occipital lobe, respectively.Top right and bottom: Dependence plots of some relevant regions according to their SHAP values.Colour in the graph corresponds to the value of a second feature for that same sample.The positive class is SCZ.Average and maximum depth are abbreviated as meandepth and maxdepth. niques combined with sulcal features to undertake a novel casecontrol classification task with schizophrenia patients.Sulcal features were obtained automatically through BrainVISA and encompass the entire cerebral cortex.These features were analysed using techniques of feature selection and extraction, considering parametric and non-parametric approaches.Then, different classification scenarios were implemented to evaluate the relevance of the features and the performance of the validation methods (resubstitution with upper bound correction and K-fold cross-validation) given the circumstances of the study, both in terms of number of samples and features.Explainable artificial intelligence techniques were also applied to detect regions of interest in schizophrenia and to compare their findings with those obtained from feature selection techniques.The performance achieved reflects potentially interesting features that have not previously been reported in terms of length and/or depth, such as the collateral fissure or the superior postcentral intraparietal superior sulcus.Moreover, expected results are obtained in temporal or precentral areas.This study makes manifest the issues involved with classification tasks using novel features obtained from small sample-size datasets.The techniques described give a roadmap for how researchers might approach a similar problem, and indicates how dimensionality reduction (feature extraction) techniques and validation methods such as upper-bounding resubstitution help to mitigate the inherent difficulties.

1
Demographic details of the participants included in the analysis.
(Lundberg & Lee, 2017)h as local accuracy, missingness and consistency(Lundberg & Lee, 2017).Several approximation methods to compute SHAP values are proposed, since its exact computation is difficult to achieve.The one applied is this study was Kernel SHAP, a model-agnostic approximation which combines Shapley values and linear LIME (local linear regression) to estimate the importance of each feature.To do this, the solution of Equation (7) are ), that is, they are Shapley values of a conditional expectation function of the original model which